Identifying Reddit Topics Using K-Means

Executive Summary

Reddit is an online discussion site, where people come together to bring up and discuss about various topics. A sample of about 6000 reddit posts was provided. The objective of this notebook was to identify the topics present among them.

The data was preprocessed by cleaning and removal of duplicated entries, and the frequently-appearing terms in the post titles were identified in order to get preliminary insights as to what the topics might be.

Clustering was performed on the the clean post titles. First, the post titles were vectorized using TDF-IDF. After that, the dimensionality of the vectors was reduced using TSVD. The criterion for dimensionality reduction was that the cumulative variance explained needed to be at least 80%. K-Means clustering was then performed and the optimum number(s) of clusters was taken based on internal validation criteria. The clusters formed were interpreted in an attempt to identify the topic of each.

Several possible topics were identified, and are listed below:

1. Introduction

Topics

On reddit, user posts are self-organized into subjects, using boards called 'subreddits'.

Discussions on reddit

User-generated content is posted on the boards (subreddits), and discussions can be generated:

Objective

By taking sample post titles, this notebook will attempt to identify the topics present among them, and consequently, reveal the subreddits present among these titles.

2. Preliminaries

Summary of the Methodology

  1. Accessing the Data
    • Read the data
    • Determine the format
    • Store in an appropriate object type (whether numpy array, DataFrame, etc.)
  2. Preprocessing
    • Clean the data
      • Remove punctuations and 'unnecessary' characters such as emoji
      • Make all letters lowercase
      • Remove duplicated entries
  3. Perform EDA
    • Determine the most frequently appearing words to gain insights as to what the clusters might be
  4. Clustering
    • Vectorize the post titles using TDF-IDF
    • Perform dimensionality reduction using Truncated SVD
    • Visualize using TSNE (optional)
    • Cluster using K-Means, from k = 2 to k = 20
      • Determine the optimum number(s) of clusters based on internal validation criteria
  5. Interpret the Results
    • Subjectively interpret the clusters formed and attempt to identify the topic of each cluster

Importing Libraries

Accessing the Data

The provided data is composed of post authors and the corresponding post titles, and is stored in the form of a .txt file.

Before working on the data, it was important to see what it looks like.

The text file was opened and its contents were read to ascertain in what format the data was stored.

It is quite difficult to read, and is full of delimiters, thus it is a .csv file. The read_csv function of pandas was used instead, specifying the tab (\t) delimiter.

ANALYSIS

The file contains three fields:

  1. [index] - number. Enclosed in brackets because this column was not named. We dropped this column and used pandas index instead.
  2. author - username of the post author
  3. title - the title of the post, but does not include the content

The task is now to cluster the topics using the post titles. Since the data is text, a bag-of-words approach will be used. Prior to that, some exploratory analyses were performed on the post titles.

Cleaning of Text

The post titles under the title columns are dirty i.e. have non-alphanumeric characters (punctuation marks and emojis).

The post titles were cleaned by removing the punctuation marks and formatting all letters to lowercase. The rationale is as follows:

In reddit_mini_project/reddict_mini_project/functions.py, three functions have been defined:

The titles were cleaned and stored under the column clean_title, so that original titles in title were retained in case they would be needed.

All three columns (includes the author) were stored in a new DataFrame, clean_df.

Duplicated entries were identified and removed.

There were 71 extra duplicated entries.

Removing duplicates:

A language detection module could have been used to only select English posts, but upon using such a module, two things were discovered:

3. EDA

The focus will be on identifying the most frequently appearing words and themes.

Most Frequent Words

Before clustering, the words and themes which crop up most frequently should be identified because they might tell something about the themes present in the posts.

Performing ngrams

The following cells joined the post titles into one long string, which were then checked for the most frequently appearing words.

Failing to derive meaningful insights, removing stopwords (words which may carry less or no meaning such as articles, conjuctions, etc.) to identify the most common meaningful words was attempted.

Performing ngrams (Stopwords removed)

The following cells joined the post titles into one long string, which were then be checked for the most frequently appearing words. Stopwords were removed.

The stopword set used was the union of the stopwords from sklearn and the stopwords from the NLTK corpus. This was done with the objective of removing as much interference as possible.

ANALYSIS

Getting the most frequent words with the stopwords removed helped give an idea of some of the themes present in the post titles.

Preliminary Findings:

  1. TIL
    • The top word is 'til', which should actually be capitalized as 'TIL'. 'TIL' means "Today I Learned", or it is a phrase that can signify a new or amazing discovery.
    • TIL may be explored in greater depth later in this report.
  2. 2016 U.S. Presidential Elections
    • The third most common word is 'trump', which is an English word, but more likely is referring to Donald Trump, the current President of the U.S., who successfully ran for office in 2016.
    • Also present are 'sanders', and 'clinton', likely referring to Bernie Sanders and Hillary Clinton, respectively, the Democratic Party candidates for the 2016 elections.
    • The number '2016', likely referring to the year 2016, may be a nod to the elections as well.
  3. Games
    • 'Game' comes out top in the word cloud, and thus is probably a popular topic.
  4. New, and Year
    • The words 'new' and 'year' frequently appear. However, 'new' appears more frequently, implying that it is also used with other terms, not necessarily referring to the "New Year".

4. Clustering

Rationale

Internal Validation Criteria

The following internal validation criteria were used on the output of the K-Means clustering.

SSE - this metric indicates the distances of the individual data points from their respective cluster centroids. The objective is to minimize the value of SSE, but at the same time, it is important keep the clusters parsimonious (low number of clusters). Thus, SSE represents a trade-off.

CH - the Calinski-Harabasz index (CH) is the ratio of the between-clusters dispersion mean and the within-cluster dispersion. For CH, larger values are desirable.

Intra-Inter - the ratio of the average distance of pairs $P$ that belong to the same cluster, and the average distance of pairs $Q$ that belong to different clusters. Since intra-distances are ideally less than the inter-distances, this metric should be minimized.

Silhouette Coefficient - let $Davg^{in}_i$ be the average distance of $x_i$ to data points within the cluster of $x_i$, and $Dmin^{out}_i$ be the smallest average distance to points other than its own cluster. The silhouette coefficient $S_i$ for the $i$th data point specific to the ith object, is as follows: $$ S_i = \frac{Dmin^{out}_i − Davg^{in}_i}{\max\{{Dmin^{out}_i}, Davg^{in}_i\}} $$ The overall silhouette coefficient is the average of $S_i$. This coefficient is best if maximized (positive).

Vectorization

TF-IDF was used to vectorize the post titles:

Dimensionality Reduction

In performing dimensionality reduction, it is an objective to retain at least 80% cumulative variance explained. Setting n_components too high will result in exponentially greater runtimes, while setting n too low will make it less likely to deliver meaningful results.

n_components set to 3000:

ANALYSIS

As expected, increasing the number of components n leads to diminishing marginal returns as n grows larger. Thus, we will proceed to use Truncated SVD with $n=3000$, since this is a bit above the minimum n at which cumulative variance explained is at 80%.

TSNE was used to take a preliminary view of the data, and what the clusters might look like.

K-means Clustering

We now proceed to K-means clustering. To facilitate this, more functions were defined in functions.py:

Document strings were provided for each of the functions.

The following code ran cluster_range on the TSVD-transformed vectorized data, using K-Means. A random_state was chosen in order to ensure consistency of the code across runs.

Plotting the clusters:

ANALYSIS

By visual inspection, the clustering appears to have performed very poorly. The reason for this is that the clustering uses 3000 features as its basis, while TSNE visualizes the 3000-feature data in a 2-dimensional space. Thus, visual inspection on results on TSNE was discounted as a criterion for choosing the number of clusters.

The internal validation criteria were plotted below to help facilitate the choice of k.

ANALYSIS

VERDICT

Based on the criteria above, pick 7 clusters and examine the output. 7 clusters were taken instead of 4 because the latter might be too few to capture more of the possible topics. If unsatisfied, try another number of clusters, in the following order:

Round 1: K = 7

Using K means with n_clusters = 7, the output was visualized once again using TSNE.

ANALYSIS

From looking at the TSNE visualization, it can be seen that the clustering assigned most of the members to cluster 3 (teal cluster above). However, before discounting this clustering and trying other values of k, it might be helpful to look at the clusters formed.

The targets were assigned, as identified by K-Means to the DataFrame clean_df:

ANALYSIS

A subjective interpretation of the clusters that were formed at k = 7 is:

Since many possible clusters were not captured by k = 7, and due to the fact that subjectively, this clustering may be unsatisfactory. The next round of clustering will see k set to 10.

Round 2: K = 10

ANALYSIS

A subjective interpretation of the clusters that were formed at k = 10 is:

*Clusters that have appeared previously are marked with an asterisk. With this, we are implying that these are more likely to be actual clusters.

More potential clusters were captured upon setting k = 10, but with the side effect of two clusters about games. Thus, simply increasing k might not always give the best results.

Final Round: K = 8

ANALYSIS

A subjective interpretation of the clusters that were formed at k = 8 is:

*Clusters that have appeared previously are marked with an asterisk. With this, we are implying that these are more likely to be actual clusters.

Having reduced k from 10 to 8, despite the reasonable loss of some of the clusters seen earlier at k = 10, some previously unseen clusters have appeared, such as "Food" and "Web Development".

5. Interpretation and Conclusion

The following clusters were identified for the three values of k used:

7 Clusters

  • Cluster 1: Games***
  • Cluster 2: New Year***
  • Cluster 3: TIL***
  • Cluster 4: Trump***
  • Cluster 5: ambiguous
  • Cluster 6: Tech Support**
  • Cluster 7: U.S. Democrats (Clinton and Sanders)***

8 Clusters

  • Cluster 1: Games***
  • Cluster 2: TIL***
  • Cluster 3: Food*
  • Cluster 4: U.S. Democrats (Clinton and Sanders)***
  • Cluster 5: Web Development*
  • Cluster 6: Trump***
  • Cluster 7: New Year***
  • Cluster 8: Market*

10 Clusters

  • Cluster 1: U.S. Democrats (Clinton and Sanders)***
  • Cluster 2: Baking and Recipes*
  • Cluster 3: Awards*
  • Cluster 4: Tech Support**
  • Cluster 5: New Year***
  • Cluster 6: Trump***
  • Cluster 7: TIL***
  • Cluster 8: Games***
  • Cluster 9: (Games)***
  • Cluster 10: Business*

*Clusters are marked with asterisks according to the number of times they have appeared among the three values of k. With this, we are implying that these are more likely to be legitimate clusters.

Summarizing the above into one list of clusters, the following guess at revealing the subreddits was made:

Higher likelihood:

High likelihood:

Other possible clusters:

6. Acknowledgments

Special thanks to Prof. Christian Alis for the notebooks, from which I derived plenty of my code.

Thanks to my LT-mates, as well as LT 1 for some good discussions as well.

To my former LT-mate, for letting me take a look at his code for K-means clustering (there were some errors in my own notebook).

Names were not mentioned in keeping with the anonymity requirement.

7. References

Websites:

Images retrieved from:

8. Appendix

Clustering all "TIL" Posts

This section is optional, as the author does not find it to be meaningful. This clustering was relegated to the appendix because meaningful clusters failed to form with this analysis.

Vectorization

Using tfidf to vectorize the post titles:

Dimensionality Reduction

ANALYSIS

As expected, increasing the number of components n leads to diminishing marginal returns as n grows larger. We have two choices here:

Since we do not have reliable criteria for selecting the first option, we will go with the second (hardline at 80%). Thus, use Truncated SVD with $n=2500$, since this is a bit above the minimum n at which cumulative variance explained is at 80%.

Let's use TSNE to have a preliminary view of the data, and what the clusters might look like.

K means

ANALYSIS

SSE - There is a downward trend in SSE, thus increasing k will lower SSE. However, we cannot just keep on increasing k because the clustering will no longer be parsimonious (low number of k) and the clusters might not make sense. Thus, we want to keep k low.

CH - for Calinski-Harabasz index (CH), we want larger values. Thus, based on the plot above, we might pick 4, 5, or 8 clusters to hit a compromise at the 'elbow' of the graph.

Inter-Intra - A lower value of inter-intra is preferable. However, looking at the plot, there is no clear trend emerging yet at k = 2 to k = 20. This, we might pick 4 or 7 clusters as they are local minima.

Silhouette - This coefficient is best if maximized. Although the variations are small (they seem large in the plot due to the y-axis limits), if we want to keep the clusters parsimonious, then we might want to pick 5 or 7 clusters.

VERDICT

Based on the criteria above, pick 4 clusters and examine the output. If unsatisfied, try 5, or 7 clusters.

Using K means with n_clusters = 5, we visualize once again using TSNE.

ANALYSIS

From looking at the TSNE visualization, we can see that the clustering assigned most of the members to cluster 1 (purple cluster above). However, before we discount this clustering and try something else, let us take a look at the clusters formed.

Assigning the targets as identified by K Means (k = 6) to the DataFrame clean_df.